Biologically Inspired Oscilating Activation Functions Can Bridge the Performance Gap Between Biological and Artificial Neurons¶
These are notes summarizing and offering implementations of the ideas presented in this paper.
import math
from matplotlib import pyplot as plt
import numpy as np
import random
from IPython.display import display, Math
General Reasoning¶
It was recently discovered that humans's have "pyramidal" neurons, which are capable of learning XOR functions with only one neuron. This property can be re-created in "artificial neurons" by replacing the typical activation functions (sigmoid, relu, tanh, etc) with "oscillating activation functions".
Current activation functions require a 3-layer neural network with 2 hidden layers and 1 output layer in order to learn the XOR function.
Because these neurons can have more than one hyper-plane, it is expected that they would do better in classification problems.
Oscillating Activation Functions¶
Oscillating activation functions have more than one zero which allows for a single neuron to have multiple hyper-planes in their decision boundary.
Solving XOR requires at least 2 hyperplanes.
The following three oscillating activation functions were presented as "Shifted Quadratic unit", "Non-Monotonic Cubic Unit", and "Shifted Sinc Unit"
SQU = lambda z: z + z**2
NCU = lambda z: z - z**3
sinc = lambda z: 1 if z == 0 else math.sin(z)/z
SSU = lambda z: math.pi*sinc(z - math.pi)
Making Activation Functions Useful¶
Not all activation functions that follow the above properties are useful. For an activation function to be useful it must also be more-or-less linear for small values. This is because models usually start with small values, and when we take their derivatives we want them to be 1 so that the neuron has a chance to learn quickly. For that reason the following don't work well:
- $cos(z)$
- $z^2$
- $z^3$
These functions can be made more useful simply by offsetting them such that they have a derivative of 1 at the origin:
- $sin(z)$
- $z^2 + z$
- $z - z^3$
A benefit of the network being linear for small values is that the network won't immediatly use the it's non-linear features. This helps avoid overfitting.
Results¶
Results say the oscillating functions work better!
Afterthoughts¶
Correlation To Fractal Noise¶
Curiously, the second best performing activation function is supposed to be decaying sin unit, given as so:
DSU = lambda z: (math.pi/2) *( sinc( z - math.pi) - sinc(z + math.pi))
_dsu = np.vectorize(DSU)
x = np.linspace(-50, 50, 100*5)
y = _dsu(x)
plt.plot(x, y)
plt.show()
A lot of procedural texture generation (and terrain generation) use fractal noise functions, and I can't help but wonder if this helps filter the images into those sorts of components.
Re-Implementation¶
Quick re-deriviation of SGD using MSE on Oscilating Functions¶
We're quickly gonna set up the gradient equations for using MSE with a few of these functions. We'll be using a linear combination model of the form: $$model(x_1, x_2) = SQU(w_1x_1 + w_2x_2)$$ $$loss = ( model(x_1, x_2) - y) ^ 2$$ $$loss = (SQU(w_1x_1 + w_2x_2) - y)^2$$
DSU¶
var('x_1 x_2 theta_1 theta_2 y b')
(x_1, x_2, theta_1, theta_2, y, b)
hypo_1 = theta_1*x_1 + theta_2*x_2 + b
hypo_2 = ((pi/2) * (sin(hypo_1 - pi) / (hypo_1 - pi)) - (sin(hypo_1 + pi) / (hypo_1 + pi)))
j = (hypo_2 - y)**2
f = diff(j, theta_2)
Math(latex(simplify(f)))
That's uhh... pretty intense. Maybe let's try going for SQU
SQU¶
hypo_2 = hypo_1 + hypo_1**2 # This is the SQU part
j = (hypo_2 - y)**2
f = diff(j, theta_1)
simplify(f)
2*((theta_1*x_1 + theta_2*x_2 + b)^2 + theta_1*x_1 + theta_2*x_2 + b - y)*(2*(theta_1*x_1 + theta_2*x_2 + b)*x_1 + x_1)
simplify(diff(j, theta_2))
2*((theta_1*x_1 + theta_2*x_2 + b)^2 + theta_1*x_1 + theta_2*x_2 + b - y)*(2*(theta_1*x_1 + theta_2*x_2 + b)*x_2 + x_2)
simplify(diff(j, b))
2*((theta_1*x_1 + theta_2*x_2 + b)^2 + theta_1*x_1 + theta_2*x_2 + b - y)*(2*theta_1*x_1 + 2*theta_2*x_2 + 2*b + 1)
$2*((\theta_1*x_1 + \theta_2*x_2)^2 + \theta_1*x_1 + \theta_2*x_2 - y)*(2*(\theta_1*x_1 + \theta_2*x_2)*x_1 + x_1)$
NCU¶
Okay Mr TearGosling Sir said that NCU was prettier so lets have a look at it
hypo_1 = theta_1*x_1 + theta_2*x_2
hypo_2 = hypo_1 - hypo_1**3
Math(latex(diff(j, theta_2)))
Well these last two aren't so bad, and apparently NCU works best. Somehow I missed it when I read the list.
Math(latex(simplify(f)))
Actually Doing SGD Now¶
Now that we've done a little bit of leg-work with the math above, we can implement a simple version of SGD for this new activation function.
X_control = np.random.uniform(size=[100, 2])
Y_control = np.sum(0.25*X_control, axis=1) + X_control[0][0]
The paper claimed to use mean-squared error and stochastic gradient descent to optimize this, so we'll try that too.
def avg_loss(w: np.array, Xs: np.array, Ys: np.array):
s = 0
for x, y in zip(Xs, Ys):
pred = SQU(w[0]*x[0] + w[1]*x[1] + w[2])
s += (y-pred)**2
return s / Ys.shape[0]
def squ_mse_gradient(w: np.array, x: np.array, y: float):
l = w[0]*x[0] + w[1]*x[1] + w[2]
g0 = 2*((l)^2 + l - y)*(2*(l)*x[0] + x[0])
g1 = 2*((l)^2 + l - y)*(2*(l)*x[1] + x[1])
g2 = 2*((l)^2 + l - y)*(2*w[0]*x[0] + 2*w[1]*x[1] + 2*w[2] + 1)
# Without bias
#g0 = 2*((l)**2 + l - y)*(2*(l)*x[1] + x[1])
#g1 = 2*((l)**2 + l - y)*(2*(l)*x[0] + x[0])
return np.array([g0, g1, g2])
def fit(X, Y, iterations=200, lr=0.001):
w = np.array([1, 1, 1], dtype=np.float64)
for _ in range(iterations):
i = random.randint(0, Y.shape[0]-1)
x, y= X[i], Y[i]
a = squ_mse_gradient(w, x, y)
w -= lr * a
return w
weights = np.array([1, 1, 1], dtype=np.float64)
avg_loss(weights, X_control, Y_control)
31.4772287225186
weights = fit(X_control, Y_control)
avg_loss(weights, X_control, Y_control)
0.07734606914918682
Looks like it's working
Fitting XOR¶
The article claims we can use this function to learn the XOR operation with only a perceptron, and that other functions can't do this. We'll test this out with our new SGD implementation.
# Gotta make sure to shift the thing so we can
XOR = np.array([ [-1, -1], [-1, 1], [1, -1], [1, 1]])
YOR = np.array([ -1, 1, 1, -1])
weights = np.array([1, 1, 1], dtype=np.float64)
avg_loss(weights, XOR, YOR)
43.0
weights = fit(XOR, YOR, iterations=2000)
avg_loss(weights, XOR, YOR)
0.49727671369496174
for i, ii in zip(XOR, YOR):
print(i, '->', SQU(weights[0]*i[0] + weights[1]*i[1] + weights[2]) > 0)
[-1 -1] -> False [-1 1] -> True [ 1 -1] -> True [1 1] -> False
Tadah! Guess they weren't lying.